Abstract: Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the duplicates in relational data, in complex data and hierarchical data like XML. There are lots of works already presented in the past for finding the duplicates in the relational data. But nowadays there is more focus on finding duplicates in the XML data. Because of XML is very popular for data storing and extensively used for data exchange between the organizations. Here we have done an extensive literature survey on this topic and proposed a duplicate detection method that incorporates some of the existing paper's ideas and some of our original ideas. In addition to improving the efficiency and effectiveness, we also checks for its typographical errors when comparing the two XML elements. To test the correctness of our method, we are comparing it with existing duplicate detection system, and giving more focus on how we get higher precision and recall values in the various datasets we have used.

Keywords: Duplicate detection, record linkage, entity resolution, XML, Bayesian networks, data cleaning, optimization